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Downstream analyses of short-reads from next-generation sequencing platforms are often 
preceded by a pre-processing step that removes uncalled and wrongly called bases. 
Standard approaches rely on their associated base quality scores to retain the read or 
a portion of it when the score is above a predefined threshold. It is difficult to differentiate 
sequencing error from biological variation without a reference using quality scores. The 
effects of quality score based trimming have not been systematically studied in de 
novo transcriptome assembly. Using RNA-Seq data produced from lllumina, we teased 
out the effects of quality score based filtering or trimming on de novo transcriptome 
reconstruction. We showed that assemblies produced from reads subjected to different 
quality score thresholds contain truncated and missing transfrags when compared to 
those from untrimmed reads. Our data supports the fact that de novo assembling of 
untrimmed data is challenging for de Bruijn graph assemblers. However, our results 
indicates that comparing the assemblies from untrimmed and trimmed read subsets can 
suggest appropriate filtering parameters and enable selection of the optimum de novo 
transcriptome assembly in non-model organisms. 
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INTRODUCTION 

Ultra-high throughput or next generation sequencing (NGS) tech- 
nologies generates a considerable amount of data. This is desirable 
for single-nucleotide resolution of the genome and underlying 
expressed transcriptional units. Their application in sequencing 
the transcriptome is facilitated by parallel development of refer- 
ence free assembly algorithms that typically depend on the de 
Bruijn graph (Martin and Wang, 2011). This has resulted in 
an increase in the number of published transcriptome assem- 
blies for non-model organisms. However, de novo assembly is 
based on approximate computation, which is impeded by ran- 
dom variations in sampling (bias in reads) and sequencing errors. 
Sequencing errors introduces false /c-mers which increases the 
computational demands for graph resolution and the runtime 
of assembly algorithms. It is difficult to distinguish between 
sequencing errors from biological variation without a reference 
(Garber et al., 20 1 1 ), since variation becomes dominant with vol- 
ume of sequence data (Conway and Bromage, 2011). In addition, 
sampling methods aimed at enriching protein-coding (mRNA) 
transcripts are overwhelmed by bulk amounts of non-coding 
RNA (Cui etal., 2010) and immature mRNA with incompletely 
spliced introns (Garber et al., 20 1 1 ) . For researchers who outsource 
sequencing services, they do not have access to quality filtering 
tools embedded in NGS platforms (Cox etal., 2010). We can 
broadly identify two categories of pre-processing tools that address 
read usability: error correction and filtering/trimming algorithms 
which have emerged in response to low quality data. Error correc- 
tion approaches have been largely applied on genomic reads, e.g, 



Coral (Salmela and Schroder, 201 1) and Quake (Kelley et al., 2010) 
rely on multiple alignments in k-mer space and edit distance 
respectively to correct reads. Error correction maximizes the 
quantity of reads for downstream analyses but may reinforce 
errors and eliminate genuine reads with low frequency fc-mer 
(Martin and Wang, 2011). Only recently has error correction been 
applied to RNA-Seq data where the SEECER algorithm relies on 
a /c-mer profile Hidden Markov model (Le etal., 2013). How- 
ever, MacManes and Eisen (2013) compared error correction tools 
on RNA-Seq data and showed that Reptile (Yang etal, 2010) 
performed best with de novo transcriptome assembly of error 
corrected reads. Quality score based-trimming approaches are pre- 
dominantly used, but they often lead to significant loss of data (Le 
et al., 2013) and are extremely subjective. Reads are often trimmed 
in varying modes: ConDeTri (Smeds and Kiinstner, 2011) trims 
the reads from the 5', 3' or both ends over a defined number 
of bases (window) or per base and tools such as FASTX-toolkit 
(http://hannonlab.cshl.edu/fastx_toolkit/) will retain or discard 
an entire read after assessing quality over a fraction of bases 
in the read. On the other hand, read content artifacts such as 
sequencing adaptors and ribosomal RNA may need additional 
heuristics requirements for pre-processing. The effects of quality 
score based trimming and artifact removal have not been system- 
atically addressed with respect to the quality of de novo assembly 
derived transcribed fragments. Using NGS data, we compared 
reference free transcriptome assemblies derived from various cate- 
gories of quality trimmed reads: with and without artifact removal. 
Although, the RNA sample used is non-synthetic, we focused on 
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the attributes of the assemblies rather than the biological rele- 
vance of the RNA source. We report our findings and propose that 
caution must be exercised when applying quality filters prior to de 
novo assemblies and that comparing the assemblies of untrimmed 
and subcategories of trimmed reads could provide an optimal 
quality score threshold for each read. 

MATERIALS AND METHODS 
DATASETS 

Publicly available RNA-Seq data (SRR100067) and the genome 
assembly (accession AABX00000000) for wild type Neu- 
rospora crassa 74-OR23-1VA were obtained from the NCBI, 
http://www.ncbi.nlm.nih.gov/Traces/sra and http://www.ncbi.nlm. 
nih.gov/Traces/wgs respectively. Predicted coding sequences 
(CDS) for N. crassa were downloaded from http://fungidb.org 
release 2.0. In addition, the Venturia inaequalis draft genome ver- 
sion 1.0 (Hesse etal., 2013), two lanes of 100 bp paired-end and 
one lane of 75 bp single-end Illumina data were procured from 
RNA from a host free culture of V. inaequalis. The datasets and 
scripts can be accessed via ftp://ftp.sanbi.ac.za/quality.trimming 
and https://bitbucket.org/Kimbung/hsp.ratio. 

PRE-PROCESSING RNA-Seq DATA 

The raw RNA- Seq data from N. crassa was trimmed with a typically 
used minimum PHRED quality score threshold of 20 (Q20) and 
10 (Q10) using ConDeTri, with modification (Smeds and Ktin- 
stner, 2011) from the 3'-end to represent datasets one and two 
respectively. For V. inaequalis, we generated six categories of qual- 
ity trimmed or filtered reads as follows: (i) Low quality bases were 
removed at the 3'-end of each read with a PHRED quality score 
below 20 or 10 representing datasets one and two respectively, 
(ii) Potential remnants of adapter sequences were removed using 
FLEXBAR (Dodt etal, 2012) followed by trimming low quality 
bases with a PHRED quality score below 20 or 10 that repre- 
sents datasets three and four, (iii) adapter sequences only removed 
with FLEXBAR to create dataset five. A minimum read length 
of 36 bp was used for categories 1-5. A sixth category of pre- 
processed reads was obtained using the FASTX-toolkit by filtering 
reads where more than 80% of their bases have a PHRED quality 
less than 10. 

DE NOVO ASSEMBLY 

Reference free transcriptome reconstruction with the untrimmed 
and trimmed N. crassa datasets was performed with Trinity 
(release 2012-06-08; /c-mer 25; Grabherr etal, 2011). For com- 
parison, Oases (version 0.2.06; Schulz etal, 2012) was used to 
generate assemblies with various /c-mers (19-35). V. inaequalis 
datasets were assembled only with Trinity. In all cases, only default 
assembly parameters were used. Transfrags (TF) > 100 bp were 
kept for downstream analysis. 

COMPARING ASSEMBLIES 

To avoid inflation in alignment or assembly statistics, each 
assembly was checked for redundant TF using a PERL script to 
remove exact matches. We aligned the TF from N. crassa gen- 
erated with Q20 (one) and untrimmed reads to the genome 
with GMAP version 2013-10-04 (Wu and Watanabe, 2005). The 



following parameters described by Kupfer etal. (2004) were 
used: min-intron length = 20, max-intron length = 2000, total 
length = 5904. The total intron length per gene was estimated for 
N. crassa from http://fungi.ensembl.org release- 17. The aligned 
TFs were filtered at high stringency of 95% identity and 95% 
coverage. TFs from untrimmed reads that did not overlap with 
those from trimmed reads were verified against predicted CDS 
loci and recorded as missing annotations using in house PERL 
scripts for post-processing GMAP alignments. TF derived for 
the V. inaequalis untrimmed and trimmed (category one) reads 
were aligned to the V. inaequalis draft genome using exoner- 
ate version 2.2.0 (Slater and Birney, 2005) with the following 
parameters: model est2genome, maxintron = 5000. Coordinates 
for best alignment locations were considered and visualized with 
Gbrowse (http://gmod.org/wiki/GBrowse). The proteins from 
UniProt Knowledgebase (FUNGI) release 2013_02 (The UniProt 
Consortium: http://www.uniprot.org) were searched against each 
customizable databases of TF assembled from untrimmed and 
trimmed V. inaequalis reads with BLAST+ (Camacho et al., 2009). 
N. crassa TF produced with Trinity from both trimmed and 
untrimmed reads were searched against UniProt N. crassa pro- 
teins (£-value: 10e~ 10 ). Counts of number of unique high scoring 
segment pairs (HSP) were computed. The ratio of the length of the 
HSP to known Uniprot annotated proteins (hereafter referred to 
as HSP ratio) was generated with a series of in house PERL scripts 
and UNIX commands for each dataset. HSP ratio represents how 
well TF were reconstructed. Non-parametric analysis was applied 
to HSP ratios across read categories and the differences between 
the read pre-processing approaches was assessed post hoc using 
Agricolae package version 1.1-1 (de Mendiburu, 2012). 

RESULTS 

To investigate the potential side effects of quality based trim- 
ming and artifact removal on de novo transcriptome assembly, 
we analyzed datasets from a model (N. crassa) and non-model 
organism (V. inaequalis). A summary of read counts for each cate- 
gory of untrimmed and trimmed reads is shown in Table 1. More 
reads are removed when quality based trimming is preceded by 
adapter removal compared to doing the reverse. The percentage of 
trimmed reads ranged from 35 to 88%. Out of ~134 Gb V. inae- 
qualis untrimmed reads, quality trimming preceded with adapter 
removal retained the smallest amount of reads. When compar- 
ing assemblies from various categories of reads, we note that the 
number of unique TF from untrimmed reads is always higher 
than those from trimmed reads irrespective of the assembler and 
dataset used (Figure 1). For N. crassa TFs, this is much more 
profound at lower fc-mers. A similar trend is observed with the 
number of TFs, derived from untrimmed and trimmed reads that 
map to the same genomic loci. TFs produced with untrimmed 
reads recovered a higher number of known N. crassa proteins than 
those from the trimmed reads (Table 1). A total of 521 known 
gene loci were identified in N. crassa that overlapped with TFs 
derived from untrimmed but not trimmed reads. Transcriptome 
assembly statistics for each category of quality trimmed reads and 
the HSP ratios are shown in Table 1. The number of unique TFs is 
comparable among all assemblies for each organism. Untrimmed 
reads generated the largest number of TFs and identified the 
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Table 1 I Attributes of transfrags produced with Trinity. 
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largest numbers of known Uniprot proteins. Sequence similarity 
search identified 791 proteins that were present in all V. inae- 
quaiis assemblies. For N. crassa, 6218 proteins were common to all 
assemblies generated with Trinity. Kruskal-Wallis one-way analy- 
sis of variance suggest that quality score base pre-processing had 
a significant effect on TF quality in both N. crassa (p = 0.002999) 
and V. inaequaiis (p < 2.2e— 16) data. The mean and median HSP 
ratios for TF from untrimmed reads were slightly higher than 
those from trimmed reads for both N. crassa and V. inaequaiis. In 
addition, the untrimmed datasets has the least variation (Table 1). 
Multiple comparisons testing between HSP ratio is show in Table 1 . 
Post hoc analysis indicated that the more aggressive Q20 trimming, 
produced TF of inferior quality compare to the Q10. TF from Q10 
and the untrimmed reads yielded not significant different in HSP 
ratio. Groups with the same letters are not statistically different. 
Category one and two trimming strategies were significantly dif- 
ferent than the other five categories (p < 0.01), for V. inaequaiis. 
In both N. crassa and V. inaequaiis datasets, TF from untrimmed 
reads produced higher N50 values. Visual assessment of aligned 
V. inaequaiis TF from untrimmed and trimmed reads (category 
two), reveals missing TF and incomplete TF reconstruction in the 
latter as shown in Figure 2. 

DISCUSSION AND CONCLUSION 

In this study, we teased apart the effects of quality based trim- 
ming and artifact removal on the quality of de novo transcrip- 

tome assembly. Quality based trimming approaches are routinely observed that, adapter removal was more efficient when performed 

applied on reads generated from NGS platforms. Initial analy- prior to quality based-trimming. When reads are quality trimmed 

sis by Garg et al. (2011) suggested that this procedure improved prior to adapter removal, the sequences may become too short for 

de novo transcriptome assembly. However the choice of per base substring recognition. The higher median and mean HSP ratios 

quality score for trimming is subjective and there is no consensus and the number of Uniprot identified V. inaequaiis proteins, sug- 

on quality filtering/trimming thresholds since the quality score gest that TF derived proteins from assembled untrimmed reads 

distribution is non-uniform across samples and the technologies aligned with better quality than those from trimmed reads. Addi- 

for sequencing are constantly evolving. In addition, the study by tional support for this observation is revealed by the number of 

Garg et al. (20 1 1 ) employed a genome assembler which is not suit- missing annotations in TFs from trimmed N. crassa reads. This 

ably optimized for transcriptome reconstruction and this could corroborates anecdotal observation that quality trimming of reads 

have had an impact on the interpretation of their results. We can produce poor assemblies (Paszkiewicz and Studholme, 2010). 
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FIGURE 1 | Distribution of unique (solid circles) and overlapping 
(diamond shaped) transfrags (TF) from Neurospora crassa. TF from 
untrimmed and trimmed reads that map to common a genomic loci can be 
considered as overlapping. Below k-23, there is considerable difference in 
the number of unique and overlapping TF between the trimmed and 
untrimmed categories. TF from untrimmed and trimmed reads are 
represented in red and blue, respectively. 
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FIGURE 2 | A GBrowse snapshot of predicted genes and transf rags (TFs) for V. inaequalis. Ab initio gene predictions are shown in red. TFs produced by 
Trinity with untrimmed and trimmed (category one) reads are shown in orange and green, respectively. 



Untrimmed reads result in more contiguous assemblies, which is 
probably due to a larger number of paired reads that provide sup- 
port for connected edges in the de Bruijn graph. Quality trimming 
affects the quantity of usable reads and for each expression level 
there is a spectrum of parameters (typically k-mer) for optimal 
transcript assembly (Schulz et al, 2012). In non-model organism, 
there is an optimal number of reads balancing coverage and errors 
(Francis etal., 2013) and aggressive trimming or filtering strate- 
gies are likely to affect the coverage dynamics. By applying various 
trimming or filtering approaches, the number of reads appropriate 
for assembly is achievable when gaged correctly with an suitable 
metric such as HSP ratio for evaluating the assembly. While quality 
based trimming is routinely applied prior to de novo transcrip- 
tome assembly, our analyses suggest that this could lead to missing 
annotations and incomplete transcript reconstruction. As such, 
caution must be exercised given that quality score thresholds for 
read trimming or filtering are subjective. Promiscuous application 
of quality score based trimming and or filtering should be gaged 
and additional effective heuristics assessment of transcript recon- 
struction be applied for each trimming criteria. Furthermore, our 
analyses demonstrate that HSP ratio in addition to N50 can assist 
in selecting the optimal transcriptome assembly. 
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